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Talking Points 


Threaded GL API dispatch 
ə Concept 
o Implementation details 
ə Making it fast 
ə Making it faster 


ə Missing relevant features in OpenGL 


Note the Footnote 


Application makes API calls 


Store function IDs and arguments in a buffer 
Don't execute the actual function 


o 
e@ Return control to the application 
ə Have a secondary thread do the real work 


e Retrieve function IDs and args from the buffer 
ə Execute the actual function 


ə ...as long as postponing the side effects is fine 


“Threaded”! refers to offloading the work to another thread 


1 “threaded dispatch” usually refers to a certain design of an interpreter loop 
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Not That Easy 


You can’t naively make an API call asynchronously when it 
ə ...returns a value 


ə ...dereferences pointers into application memory 
ə pointer given in arguments 
ə pointer escaped via previous calls 
ə ...unless async behavior allowed by the spec 
(glArrayElement) 


ə ...specified to have a synchronizing effect (glFinish) 
ə ...just better be synchronous (g1XSwapBuf fers) 
Solutions: 


ə Synchronize (stall until the secondary thread catches up) 
big hammer, always works 


ə If API call needs a const pointer to a small array, just copy it 
ə Use API semantics to your advantage in other ways 


No Silver Bullet 


Won't buy you anything if the application is 
ə ...100% GPU bound 


ə ...100% CPU bound all outside the driver 
not helping the bottleneck 


ə ... 100% CPU bound all in the driver 
moving the bottleneck to another thread 


Ideal case: 
ə CPU bound, 50% in GL driver on the critical path 
@ No API calls causing synchronization stalls 

Ideal theoretical speedup is “about 2x” 


Not Exactly New 


Been done before: 


e NVIDIA: __GL_THREADED_OPTIMIZATIONS, 2012 
(years after Windows driver got “Multicore Optimizations’ ) 


ə Mesa: anholt/glthread-5 branch 
What's going to be new here 
ə Standalone, vendor-independent 


ə Will come with a stall profiler 


Principles of Operation 


To perform threaded offload, one needs: 
ə Secondary worker threads 
ə Mechanism to pass API call args 
oe Synchronization mechanism 


ə Producer/consumer stubs for each GL entrypoint 


One worker thread for each application thread touching GL/GLX 
e@ 1-1 producer-consumer correspondence 
@ Never touch libGL from original application threads 
o When to spawn: 
In GLX calls, spawn worker if doesn't exist yet 
In GL calls, no need to care 
o When to cleanup: 
when the corresponding application thread exits 
(using pthread _key create) 


One worker thread for each application thread touching GL/GLX 
ə 1-1 producer-consumer correspondence 
@ Never touch libGL from original application threads 
ə When to spawn: 
In GLX calls, spawn worker if doesn't exist yet 
In GL calls, no need to care 
ə When to cleanup: 
when the corresponding application thread exits 
(using pthread _key create) 
Tried and discarded another approach: 
ə Spawn one worker per active context 
ə Turns out NVIDIA driver gets slower with 
pthread mutex_unlock high in perf profiles 
oe Presumably attempts to protect internal datastructures with 
mutexes when mulithreaded, even with one context 
ə Exact logic is unclear 
ə Need to dlopen NVIDIA libGL from worker thread as well! 


One ring buffer for each producer-consumer pair 
ə Size/align 4MB/4MB — get a hugepage if lucky 
ə Data layout just natural: 
e Function ID followed by arguments 
e Variable-length arrays preceded by length 
ə Primitive types aligned to their size 
ə Prescribe maximum argument size (e.g. 16K) 


ə Useful to keep small glBufferSubData calls async 
ə For larger sizes, make a synchronous call without copying 


Synchronization 


Threads occasionally need to suspend: 

oe Consumer: ring buffer empty 

ə Producer: ring buffer may overflow on next call 

ə Producer: when making a synchronous call 
When one suspends, the other needs to wake it 
Approach taken: 


ə For producer and consumer, maintain 
ə Current pointer into ring buffer 
ə “Suspended” flag 
ə Suspend /wakeup: 
ə Futex operations on pointers 
ə Fits almost? perfectly 
ə Consumer: sched_yield() a few times before suspending 


? heeds endian-dependent hacks 
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Need two stubs for each GL API entrypoint 
ə Almost 3000 functions (counting all extensions) 
ə Must have automatic codegen 
Need formal API specs to do codegen 
ə Old GL specs: incomplete, deprecated 
o New GL specs 


ə XML 
ə Not informative enough 


ə APITrace specs: very nice 
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Function(ASYNC, Void, glVertex2f, ((GLfloat, x), (GLfloat, y))) 


Function(ASYNC, Void, glVertex2f, ((GLfloat, x), (GLfloat, y))) 


void glVertex2f (GLfloat x, GLfloat y) 
1 

PFUNC (glVertex2f); 

PUT (x); 

PUT(y); 

PDONE; 
} 


static void worker_glVertex2f (void) 
{ 

GLfloat x; 

GLfloat y; 

CFUNC (glVertex2f); 

GET(x); 

GET(y); 

CDONE; 

CNEXT(glVertex2f) (x, y); 


Producer Stub Assembly 


glVertex2f: 

# Get thread-specific context (cheat: IE TLS) 
movq current@gottpoff(%rip), Arax 
movq Yes: hrax), %rdi 

# Get ring buffer pointer 
movq 256(%rdi), Yrsi 

# Save Function ID 
movl $216, (Arsi) 

# Advance ring buffer pointer 
leaq 16(4rsi), *rdx 

# Save args 
movss %xmm0, 4(%rsi) 
movss %4xmmi, 8(%rsi) 

# Store ring buffer pointer and handle overflow 
jmp producer_advance 
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Consumer Stub Assembly 


worker_glVertex2f: 
# Load args 
movss 4(%rbx), %xmmO 
movss 8(%rbx), %xmmi 
# Advance ring buffer pointer 
leaq 16(%rbx), %rbx 
# Jump to vendor 1ibGL 
jmp */rax 


Workers are very small thanks to custom ABI. 
Use return register (rax) for driver function pointer 
Use callee-saved registers (rbx, r15) for 


ə Ring buffer pointer 
ə Current context data (very rarely needed) 


Only a matter of 3 global register vars (GCC extension) 
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Stall Profiler 


Producer side can output stall timing statistics: 


41 fps 
92.1 syncs per frame 
0 waits per frame (due to overflow) 


sync: 78.2% 


wait: 0% 

glXSwapBuffers: 41 88.6% 
glGetIntegerv: 1447 6.85% 
glCheckFramebufferStatus: 1406 2.82% 
glMapBufferRange: 592 1.02% 
glBufferData: 143 0.326% 
glTexImage3D: 5 0.124% 
glGetError: 41 0.057% 
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Fake It Till You Make It 


Fast offload not useful if you sync all the time 


ə Chances are, you will... 


ə ...unless the application was heavily optimized with driver 
threading in mind 


ə Want some way to forgo syncs when possible 


Ways to avoid thread syncs: 
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Fake It Till You Make It 


Fast offload not useful if you sync all the time 


ə Chances are, you will... 
ə ...unless the application was heavily optimized with driver 
threading in mind 


ə Want some way to forgo syncs when possible 


Ways to avoid thread syncs: 
ə Guess and hope for the best 
o glGetError() {return GL_NO_ERROR; } 
ə glCheckFramebufferStatus() — likewise 
ə Try to track some GL state 


ə Intercept glBindFramebuffer (GL_DRAW_FRAMEBUFFER, fbo) 
o Answer glGetIntegerv (GL_DRAW_FRAMEBUFFER_BINDING) 
queries 
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glMapBufferRange(target, offset, length, 
GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED BIT) 
shouldn't sync, right? 
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glMapBufferRange(target, offset, length, 
GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT) 
shouldn't sync, right? 


ə Give data = malloc(length) to the application 


ə Remember (offset, length, data) for target 
ə When application calls glUnmapBuffer: 


ə glBufferSubData(target, offset, length, data) 
ə free(data) 
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glMapBufferRange(target, offset, length, 
GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT) 
shouldn't sync, right? 
ə Give data = malloc(length) to the application 


ə Remember (offset, length, data) for target 
ə When application calls glUnmapBuffer: 


ə glBufferSubData(target, offset, length, data) 
ə free(data) 


Only do it if length is small enough 
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Tangle and Mangle 


Contradicting goals 
ə Threaded dispatch 


ə Simple 1:1 call mapping 
ə Low overhead 


o Sync avoidance: 


ə Do some tracking — not free 
ə Call transformations — plenty of room for error 
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Tangle and Mangle 


Contradicting goals 
ə Threaded dispatch 
ə Simple 1:1 call mapping 
ə Low overhead 
o Sync avoidance: 
ə Do some tracking — not free 
ə Call transformations — plenty of room for error 
Completely separate in two libraries: 
ə tangl — pure threaded dispatch 
ə Simple, correct, fast 
ə Good enough for “well-behaved” applications 
ə mangl — call transformation 


All kinds of questionable hacks to sync avoidance 
Plenty of room for error 

Ability to deviate from GL spec (should be configurable) 
Adds overhead 


18 / 25 


Enabling asynchronous memory access in the driver 

No way in core GL to say: 
ə Here’s a memory range in the application address space 
ə | promise | won't modify or unmap it 
o Therefore the driver may access it asynchronously 
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Enabling asynchronous memory access in the driver 
No way in core GL to say: 
ə Here’s a memory range in the application address space 
ə | promise I won't modify or unmap it 
o Therefore the driver may access it asynchronously 
Example use case: 
ə mmap a resource file 
glTexImage from mmap’ed range 
glFenceSync 
do something else 
glClientWaitSync 
@ munmap 
or glReadPixels/glGetBufferSubData into a prescribed buffer 
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Enabling asynchronous memory access in the driver 
No way in core GL to say: 
ə Here’s a memory range in the application address space 
ə | promise | won't modify or unmap it 
o Therefore the driver may access it asynchronously 
Example use case: 
ə mmap a resource file 
glTexImage from mmap’ed range 
glFenceSync 
do something else 
glClientWaitSync 
ə munmap 
or glReadPixels/glGetBufferSubData into a prescribed buffer 
Actually this was done as extensions: 
ə GL_SGIX_async, 1998 
ə GL_NV_pixel_data_range, 2002 
Why not in main spec? 
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Missing Pieces Il: Fence Callbacks 


No way to register a user function for fence completion 
ə Callbacks are not a foreign concept in GL (debug output) 


ə Without callbacks, glClientWaitSync needs a complete 
synchronization stall in threaded dispatch 


More oddity in GL fence objects: 


ə glFenceSync conflates object creation and GPU operation 


Suitable for GL_ARB_sync2? 
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??? 


Thank you! 


Redundant And Incomplete Data 


Backup/extra slides follow 


Safety First 


You might not want this in Mesa: 


ə libpthread is required to spawn worker threads 
ə loading libpthread switches all mutexes from no-op to real 
ə on FreeBSD libpthread cannot be dynamically loaded 


@ not necessarily a good idea to absorb everything 


Higher Hanging Fruit 


In-driver implementation can do a bit better: 
ə Skip one level of GL dispatch (direct/indirect) in workers 
o Skip PLT for API calls in the worker 
ə Tune code layout for I-cache locality 


ə Do some state tracking up front (and reuse tracking code) 
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Pie in the Sky 


Interesting potential developments based on fast threaded dispatch 
layer: 

ə Low-overhead GL tracing 

ə Out-of-process GL 

@ tee dispatch 


